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ABSTRACT 

In  this  work  we  consider  the  Bayesian  Integrate  And  Shift 
(BIAS)  model  for  learning  object  categories  and  test  its  per¬ 
formance  on  learning  and  recognizing  different  object  cate¬ 
gories  from  real-world  images.  In  contrast  to  conventional 
learning  algorithms  that  require  hundreds  or  thousands  of 
training  examples,  we  show  that  our  system  can  learn  a 
new  object  category  from  only  a  few  examples.  In  addition, 
our  system  provides  information  not  only  about  the  object 
category  but  also  about  the  local  regions  within  the  object 
on  which  it  is  fixating.  We  tested  the  performance  of  the 
system  on  very  challenging  examples  of  partially  occluded 
targets.  The  training  was  done  on  different  instances  of  one 
category  and  tested  on  partially  occluded  examples  that  the 
system  had  never  seen  before.  We  demonstrate  that  the  sys¬ 
tem  is  very  robust  to  partial  occlusions  and  clutter  and  can 
recognize  a  target  even  if  it  fixates  on  the  occluded  part. 

1.  INTRODUCTION 

Detection  and  identification  of  partially  occluded  targets 
in  complex  scenes  becomes  an  increasingly  important  task 
in  light  of  the  latest  developments  in  urban  warfare.  The 
construction  of  a  system  that  can  automatically  identify  se¬ 
lected  targets  or  direct  soldiers  attention  to  the  locations 
that  may  contain  suspicious  activity  can  be  of  great  use  not 
only  as  a  tool  that  can  reduce  the  cognitive  workload  of  the 
soldier  but  also  as  a  tool  that  can  alert  the  soldier  to  possible 
threats. 

Identifying  a  target  in  a  complex  scene  is  a  challenging 
problem  that  incorporates  several  important  aspects  of  vi¬ 
sion  including:  translation  and  scale  invariant  recognition, 
robustness  to  noise  and  ability  to  cope  with  significant  vari¬ 
ations  in  lighting  conditions.  Identifying  an  occluded  target 
adds  another  layer  of  complexity  and  this  problem  can  be 
extremely  difficult  even  for  humans.  Motion  information 
can  be  of  great  help  in  providing  an  initial  figure-ground 
segmentation.  However,  in  many  situations  motion  infor¬ 
mation  is  not  available.  In  addition,  if  the  input  to  the  sys¬ 
tem  is  a  video  stream  then  the  requirement  that  the  system 
works  in  real-time  often  precludes  the  use  of  more  sophis¬ 


ticated  but  computationally  involved  techniques. 

One  of  the  main  limitations  of  classical  vision  algo¬ 
rithms,  such  as  those  utilizing  Artificial  Neural  Networks 
(ANNs),  Radial  Basis  Functions  (RBFs),  and  Support  Vec¬ 
tor  Machines  (SVMs),  is  that  they  require  a  fixed  size  in¬ 
put.  This  means  that  during  the  recognition  phase  the  input 
vector  to  the  system  has  to  be  of  the  same  size  as  the  in¬ 
put  vector  used  during  the  training  process.  Such  systems 
are  therefore  not  well  suited  for  occlusion  problems  where 
sections  of  the  input  vector  are  simply  missing  or  carry  in¬ 
correct  information. 

In  addition,  supplying  a  fixed  size  input  to  the  recog¬ 
nition  system  requires  the  selection  of  the  specific  region 
from  the  image.  This  means  that  such  systems  have  to 
solve  the  segmentation  problem,  find  the  boundary  of  the 
region  occupied  by  the  target.  However,  given  an  image, 
it  is  not  known  where  the  target  is  or  what  its  size  is.  In 
order  to  detect  a  target,  regardless  of  its  location,  the  detec¬ 
tion  system  is  usually  (as  presented  in  Schneiderman  and 
Kanade  (2000)  convolved  over  the  whole  image  and  in  or¬ 
der  to  detect  a  target  at  different  scales  the  original  image 
is  rescaled  and  the  convolution  procedure  repeated.  Since 
the  methods  that  rely  on  exhaustive  search  are  not  compu¬ 
tationally  efficient,  they  are  mostly  applied  to  detection  of 
targets  in  static  images. 

Human  visual  system,  on  the  other  hand,  does  not  re¬ 
quire  any  “presegmentation”  of  the  image  in  order  to  rec¬ 
ognize  a  specific  object.  In  fact,  when  we  look  at  an  object, 
our  visual  system  processes  not  only  information  coming 
from  the  object  itself  but  the  whole  scene.  This  is  accom¬ 
plished  through  an  array  of  neurons  that  are  selective  to 
specific  features  and  whose  receptive  fields  (RFs)  are  spa¬ 
tially  distributed  and  localized.  Although  our  visual  system 
processes  information  from  all  the  regions  of  the  scene,  it 
appears  as  if  it  somehow  knows  to  “discard”  certain  regions 
(the  background)  and  integrate  only  information  from  the 
object  regions.  If  we  are  not  able  to  recognize  an  object 
from  a  single  fixation,  then  we  make  saccades,  combine 
evidence  from  different  fixations  and  as  a  result  usually  im¬ 
prove  our  perception  of  the  object. 

Since  our  visual  system  integrates  information  from  neu¬ 
rons  that  have  localized  receptive  fields,  it  seems  natural  to 
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represent  an  object  as  a  collection  of  localized  features.  In 
contrast  to  global  models,  such  as  those  that  use  a  Principal 
Components  Analysis  (PCA)  approach,  feature-based  ap¬ 
proaches  are  much  more  robust  to  partial  occlusions.  Over 
the  past  years,  feature-based  approaches  had  become  in¬ 
creasingly  popular  within  the  computer  vision  community 
Lowe  (1999);  Schmid  and  Mohr  (1997);  Serre  et  al.  (2005); 
Heisele  et  al.  (2001);  Torralba  et  al.  (2004).  These  ap¬ 
proaches  have  been  succesfully  used  in  various  applica¬ 
tions  such  as  face  recognition  Schneiderman  and  Kanade 
(2000);  Viola  and  Jones  (2001),  handwriting  recognition 
Wang  et  al.  (2005);  Neskovic  et  al.  (2000),  car  detection 
Agarwal  et  al.  (2004);  Schneiderman  and  Kanade  (2000); 
Neskovic  et  al.  (2004),  and  modeling  human  bodies  Felzen- 
szwalb  and  Huttenlocher  (2005).  One  of  the  problems  of 
probabilistic  feature  based  approaches  (such  as  Fei-Fei  et  al. 
(2003))  is  that  they  can  not  model  an  object  with  a  large 
number  of  features  since  calculating  the  joint  probabilities 
would  require  an  enormous  amount  of  training  data.  An¬ 
other  problem  is  how  to  find  the  best  constellation  of  fea¬ 
tures.  In  one-dimensional  case  this  problem  can  be  solved 
using  a  dynamic  programming  approach  but  for  two  dimen¬ 
sional  case  this  is  still  an  open  problem  and  no  exact  solu¬ 
tion  that  is  at  the  same  time  computationally  efficient  exists 
today.  In  contrast  to  approaches  presented  in  Fei-Fei  et  al. 
(2003);  Serre  et  al.  (2005),  our  model  uses  much  simpler 
features  and  does  not  require  a  feature  learning  stage.  Fur¬ 
thermore,  unlike  the  model  of  Fei-Fei  et  al. ,  our  system 
can  use  an  arbitrarily  large  number  of  features  without  an 
increase  in  computational  complexity. 

The  main  question  therefore  is  how  to  deal  with  com¬ 
putational  complexity  when  analyzing  large  amounts  of  in¬ 
formation  contained  in  visual  scenes.  It  seems  natural,  that 
in  designing  a  system  for  scene  analysis  we  should  use 
some  properties  of  the  best  existing  system  for  analyzing 
visual  scenes  -  the  human  visual  system.  Unfortunately,  bi¬ 
ologically  inspired  models  Keller  et  al.  (1999);  Rybak  et  al. 
(1998)  and  models  of  biological  vision  Amit  and  Mascaro 
(2003);  Mel  (1997);  Riesenhuber  and  Poggio  (1999)  have 
been  much  less  successful  (in  terms  of  real-world  applica¬ 
tions)  compared  to  computer  vision  approaches.  A  model 
that  captures  some  properties  of  human  saccadic  behavior 
and  represents  an  object  as  a  fixed  sequence  of  fixations  has 
been  proposed  by  Keller  et  al.  Keller  et  al.  (1999).  Simi¬ 
larly,  Hecht-Nielsen  and  Zhou  Hecht-Nielsen.  and  Zhou 
(1995)  and  Rybak  et  al.  Rybak  et  al.  (1998)  presented  mod¬ 
els  that  are  inspired  by  the  scanpath  theory  Noton  and  Stark 
(1971).  Although  these  models  utilize  many  behavioral, 
psychological  and  anatomical  concepts  such  as  separate 
processing  and  representation  of  “what”  (object  features) 
and  “where”  (spatial  features:  elementary  eye  movements) 
information,  they  still  assume  that  an  object  is  represented 
as  a  sequence  of  eye  movements.  In  contrast  to  these  ap¬ 
proaches,  our  model  does  not  assume  any  specific  sequence 
of  saccades  and  therefore  is  more  general. 


In  this  work  we  consider  the  Bayesian  Integrate  And 
Shift  (BIAS)  Neskovic  et  al.  (2006)  model  for  learning 
object  categories  and  test  its  performance  on  learning  and 
recognizing  different  object  categories  from  real-world  im¬ 
ages.  In  contrast  to  conventional  learning  algorithms,  such 
as  ANNs,  that  require  hundreds  or  thousands  of  training 
examples,  we  show  that  our  system  can  learn  a  new  object 
category  from  only  a  few  examples.  In  addition,  our  system 
provides  information  not  only  about  the  object  category  but 
also  about  the  local  regions  within  the  object  on  which  it  is 
fixating.  We  tested  the  performance  of  the  system  on  very 
challenging  examples  of  partially  occluded  targets.  The 
training  was  done  on  different  instances  of  one  category 
and  tested  on  partially  occluded  examples  that  the  system 
had  never  seen  before.  We  demonstrate  that  the  system  is 
very  robust  to  occlusions  and  clutter  and  can  recognize  a 
target  even  if  it  fixates  on  the  occluded  part. 

The  paper  is  organized  as  follows.  In  section  2  we  give 
an  overview  of  the  BIAS  model  for  learning  new  object  cat¬ 
egories.  In  section  3  we  discuss  implementation  details.  In 
section  4  we  illustrate  the  performance  of  the  system  when 
tested  on  different  object  categories  and  several  instances 
of  occluded  faces.  In  section  5  we  summarize  the  main 
properties  of  our  model  and  the  impact  of  the  system  on 
the  warfighter. 

2.  THE  MODEL 

Our  model  falls  into  a  category  of  feature-based  approaches 
Fei-Fei  et  al.  (2003);  Lowe  (1999);  Torralba  et  al.  (2004); 
Serre  et  al.  (2005);  Schneiderman  and  Kanade  (2000);  Vi¬ 
ola  and  Jones  (2001).  The  problem  that  we  want  to  solve  is 
as  follows:  given  a  collection  of  features,  their  locations  X , 
and  appearances  A  we  want  to  calculate  the  probability  that 
they  represent  an  object  of  a  specific  class  n,  P(On \X,  A). 
Since  calculating  this  probability  is  extremely  difficult  if 
the  number  of  features  is  large,  we  seek  to  find  suitable 
approximations.  One  of  the  biggest  simplifications  is  to 
assume  that  the  feature  locations  are  fixed  and  that  all  the 
variations  are  due  to  appearances.  Unfortunately,  this  is 
one  of  the  least  reasonable  assumptions  which  holds  in  only 
few  practical  situations. 

In  order  to  make  the  model  more  realistic,  one  should 
include  tolerance  to  variations  in  feature  locations.  Instead 
of  assuming  that  a  feature  is  located  at  a  point,  we  will  as¬ 
sume  that  it  is  located  within  a  region.  The  question  is  how 
to  design  these  regions?  If  we  use  large  regions,  we  can 
then  easily  capture  all  possible  variations  in  feature  loca¬ 
tions  (excellent  generalization)  but  at  the  expense  of  losing 
location  specificity  which  would  decrease  discrimination 
capability  of  the  model.  On  the  other  hand,  very  small  re¬ 
gions  would  provide  excellent  localization  but  would  lead 
to  poor  generalization.  We  propose  that  the  solution  to 
this  trade-off  between  generalization  and  retaining  loca¬ 
tion  specificity  is  to  use  retina-like  distribution  of  regions 
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in  combination  with  saccade-like  shifts.  If  we  want  to  es¬ 
timate  the  location  of  a  specific  feature,  then  the  size  of 
the  region  where  it  can  be  found  (the  uncertainty)  depends 
on  the  location  of  the  point  with  respect  to  which  we  mea¬ 
sure  its  distance  -  the  center.  The  further  away  the  feature 
is  from  that  center,  the  larger  the  uncertainty.  Therefore, 
in  order  to  capture  variations  in  feature  locations,  the  sizes 
of  the  regions,  as  well  as  their  overlaps,  have  to  increase 
with  their  distance  from  the  center.  As  a  consequence,  the 
accuracy  of  estimating  feature  locations  is  high  only  for 
the  features  that  are  close  to  the  center.  In  order  to  ob¬ 
tain  good  location  estimates  for  the  features  that  are  further 
away  from  the  center,  the  recognition  system  would  have 
to  shift  the  center,  to  make  a  ’’saccade”. 

Modeling  an  object.  Let  us  now  assume  that  we  are 
given  a  large  number  of  regions  that  form  a  fixed  grid  and 
completely  cover  an  input  image.  Each  such  region  we  call 
a  receptive  field  (RF)  and  with  it  we  associate  a  group  of 
feature  detectors  that  signal  the  presence  of  the  features  to 
which  they  are  selective.  This  fixed  mask  of  the  RFs  can 
be  positioned  anywhere  in  the  image  and  the  location  over 
which  the  smallest  RF  is  positioned  is  the  fixation  point. 
We  will  call  a  configuration  consisting  of  the  outputs  of 
feature  detectors  associated  with  a  specific  fixation  point  a 
view.  Since  there  can  be  as  many  views  as  there  are  (fixa¬ 
tion)  points  within  the  object,  it  means  that  that  the  number 
of  views  can  be  extremely  large  even  for  objects  of  small 
sizes.  In  order  to  reduce  the  number  of  views,  we  will  as¬ 
sume  that  some  views  are  sufficiently  similar  to  one  an¬ 
other  so  that  they  can  be  clustered  into  the  same  view.  In 
this  way,  an  object  is  modeled  as  a  collection  of  views  and 
therefore  has  as  many  labels  as  there  are  views. 

Notations.  With  symbol  H  we  denote  a  random  vari¬ 
able  with  values  H  =  (n,  i)  where  n  goes  through  all  pos¬ 
sible  object  classes  and  i  goes  through  all  possible  views 
within  the  object.  Instead  of  (n,  i),  we  use  the  symbol  H™ 
to  denote  the  ith  view  of  an  object  of  the  nth  (object)  class. 
The  background  class,  by  definition,  has  only  one  view. 
With  variable  y  we  measure  the  distances  of  the  centers  of 
the  RFs  from  the  fixation  point.  The  symbol  Drk  denotes 
a  random  variable  that  takes  values  from  a  feature  detec¬ 
tor  that  is  positioned  within  the  RF  centered  at  yk  from  the 
central  location,  and  is  selective  to  the  feature  of  the  rth 
(feature)  class,  Drk  =  dr(yk).  The  symbol  At  denotes  the 
outputs  of  all  the  feature  detectors  for  a  given  fixation  point 
xt  at  time  t.  With  variable  z  we  measure  the  distances  of 
the  previous  fixation  locations  (view  centers)  with  respect 
to  the  location  of  the  current  fixation  point.  For  example, 
the  symbol  z\_x  denotes  the  location  of  the  center  of  the 
jth  view  at  time  t  —  1.  The  collection  of  the  locations  of  all 
the  view  centers,  up  to  time  £,  we  denote  with  the  symbol 
Bt. 

What  we  want  to  calculate  is  how  spatial  information, 
coming  from  different  feature  detectors,  as  well  as  infor¬ 
mation  from  previous  fixations  (the  centers  of  the  previous 


views)  influence  our  hypothesis,  p(Hp\At,  Bt).  In  order 
to  gain  a  better  insight  into  dependence  of  these  influences, 
we  will  start  by  including  the  evidence  coming  from  one 
feature  detector  and  then  increase  the  number  of  feature 
detectors  and  fixation  locations. 

Combining  information  within  a  fixation.  Let  us  now 
assume  that  for  a  given  fixation  point  xo,  the  feature  of  the 
rth  class  is  detected  with  confidence  dr(yk )  within  the  RF 
centered  at  yk.  The  influence  of  this  information  on  our 
hypothesis,  iTJ1,  can  be  calculated  using  the  Bayesian  rule 
as 

p(H?\dr(yk),x0)  =  P(rfr(^)l^oMgrko); 

p(dr(yk)\x0) 

where  the  normalization  term  indicates  how  likely  it  is  that 
the  same  output  of  the  feature  detector  can  be  obtained 
(or  “generated”)  under  any  hypothesis,  p(dr(yk)\%o)  = 

We  will  now  assume  that  a  feature  detector  with  RF 
centered  around  yq  and  selective  to  the  feature  of  the  pth 
class  outputs  the  value  dp(yq).  The  influence  of  this  new 
evidence  on  the  hypothesis  can  be  written  as 

p(H?\dp(yq),dr(yk),x 0)  = 

P(dp(yq)\dr(yk),  Hj,  xo)p(H?\dr(yk),  x0) 
p(dP(yq)\dr(yk),x 0) 

The  main  question  is  how  to  calculate  the  likelihood  term 
p(dp  (yq)\dr  (yk) ,  iTf,  xo)l  In  principle,  if  the  pattern  does 
not  represent  any  object  but  just  a  random  background  im¬ 
age  the  outputs  of  the  feature  detectors  dp(yq )  and  dr  (yk ) 
are  independent  of  each  other.  If,  on  the  other  hand,  the 
pattern  represents  a  specific  object,  say  an  object  of  the  nth 
class,  then  the  local  regions  of  the  pattern  within  the  detec¬ 
tors  RFs,  and  therefore  the  features  that  capture  the  proper¬ 
ties  of  those  regions,  are  not  independent  from  each  other, 
p(dp(yq)\dr(yk),Hn,x o)  ^  p{dp(yq)\Hn,x0).  However, 
once  we  introduce  a  hypothesis  of  a  specific  view,  the  fea¬ 
tures  become  much  less  dependent  on  one  another.  This  is 
because  the  hypothesis  Hp  is  much  more  restrictive  and  at 
the  same  time  more  informative  than  the  hypothesis  about 
only  the  object  class,  Hn.  Given  the  hypothesis  Hn ,  each 
feature  depends  both  on  the  locations  of  other  features  and 
the  confidences  with  which  they  are  detected  (outputs  of 
feature  detectors).  The  hypothesis  Hp  significantly  reduces 
the  dependence  on  the  locations  of  other  features  since  it 
provides  information  about  the  exact  location  of  each  fea¬ 
ture  within  the  object  up  to  the  uncertainty  given  by  the  size 
of  the  feature’s  RF. 

The  likelihood  term,  under  the  independence  assump¬ 
tion,  can  therefore  be  written  as  p(dp (yq )  | dr(yk),  Hf ,  xo )  = 
p(dp (yq )  | H]1 ,  xo).  Note  that  this  property  is  very  important 
from  a  computational  point  of  view  and  allows  for  a  very 
fast  training  procedure.  The  dependence  of  the  hypothesis 
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on  the  collection  of  outputs  of  feature  detectors  Aq  can  be 
written  as 


p{H?\,A0,g0)  = 


nT-fcg^p(^r(yfc)igtK^oMgi'i^o)  n. 

J2n,i^-rkeAP(dr(yk)\H?^0)p(H^\x0) 

where  r,  k  goes  over  all  possible  feature  detector  outputs 
contained  in  the  set  Aq  and  n,  i  goes  over  all  possible  hy¬ 
potheses. 

Combining  information  across  fixations.  We  now 

calculate  how  the  evidence  about  the  locations  of  differ¬ 
ent  fixations  influence  the  confidence  about  the  specific  hy¬ 
pothesis,  77™,  associated  with  fixation  point  xt.  We  assume 
that  at  time  t  —  la  hypothesis  has  been  made  that  the  fix¬ 
ation  at  distance  zj._1  from  the  current  fixation  represented 
the  center  of  the  ith  view  of  the  object  of  the  nth  class. 
Similarly,  we  will  assume  that  at  time  t  —  2  a  hypothesis 
has  been  made  that  the  fixation  at  distance  z%_2  from  the 
current  fixation  represented  the  center  of  the  kth  view.  We 
denote  with  the  symbol  At  the  outputs  of  all  the  feature  de¬ 
tectors  that  are  used  to  calculate  the  (new)  hypothesis  77™. 
The  influence  of  the  evidence  about  the  locations  of  the  pre¬ 
vious  hypotheses  on  the  current  hypothesis  can  be  written 
as 

/'(//"  I X,  ^  ,..1,..?,)=  (4) 

PpStU  I Hj2  ^-2>  At,  xt)p(Hf\zl_2,  Au  xt) 

P(Zt-l\Zt-2>  At,  xt) 

In  order  to  make  the  model  computationally  tractable,  we 
will  assume  that  the  view  locations  are  independent  from 
one  another  given  the  hypothesis. 

Since  the  location  of  the  kth  view  of  the  object  does 
not  depend  on  the  configuration  of  feature  detectors  that  is 
associated  with  the  current  view,  and  assuming  that  view 
locations  are  independent  from  one  another,  the  likelihood 
term  from  Eq.  (4)  becomes  p(z^_1\Hj',  i^_2,  At,  xt)  = 
p(zt-i  |77™,  xt).  The  probability  that  the  input  pattern  rep¬ 
resents  the  jth  view  of  the  object  of  the  nth  class,  given  the 
activations  of  the  letter  detectors  At  and  locations  of  other 
views  Bt  can  be  written  as 

p(H?\At,xt,Bt,f(s))  = 


n  s_<mAA  \Eh  2  (5) 

Ei  n s<tP^s(s) \H h  2t)p(H?\ At,  Xt) 

where  i  goes  through  views  of  the  nth  object,  s  goes  through 
the  locations  of  all  the  fixations  and  the  function  f(s)  maps 
a  location  ys  to  a  specific  hypothesis.  With  symbol  Bt  we 
denoted  the  set  of  the  locations  of  all  the  fixations  (object 
views)  with  respect  to  the  location  of  the  current  fixation, 
xt.  The  second  term  in  the  numerator  is  calculated  using 
Eq.  (3). 


3.  IMPLEMENTATION 


Modeling  Likelihoods.  We  model  the  likelihoods  in  Eq.  (3) 
using  Gaussian  distributions.  The  probability  that  the  out¬ 
put  of  the  feature  detector  representing  the  feature  of  the 
rth  class  and  positioned  within  the  RF  centered  at  yk  has  a 
value  dr(yk ),  given  a  specific  hypothesis  and  the  location 
of  the  fixation  point,  is  calculated  as 


p(dr(yk)\H?,xt) 


P  2K)2  ’  '  ’ 


This  notation  for  the  mean  and  the  variance  assumes 
a  particular  hypothesis  so  we  omitted  some  indices,  crrk  = 
ak(n,i).  The  values  for  the  mean  and  variance  are  cal¬ 
culated  in  the  batch  mode  but,  as  we  will  see  in  the  next 
section,  only  a  small  number  of  instances  are  used  for  train¬ 
ing  so  the  memory  requirement  is  minimal.  For  modeling 
the  location  likelihoods  in  Eq.  (5)  we  use  the  multivariate 
Gaussian  distributions  since  in  this  case  the  mean  location 
is  a  vector  and  similarly  the  variance  is  a  covariance  ma¬ 
trix.  Note  also  the  difference  in  measuring  the  location  of 
the  center  of  a  specific  RF,  tjk ,  and  in  measuring  the  loca¬ 
tion  of  the  fixation  point  zk .  Although  both  distances  are 
calculated  with  respect  to  the  same  reference  point  (the  fix¬ 
ation  point)  the  locations  of  the  RFs  form  a  fixed  grid  while 
the  locations  of  fixation  points  can  vary  continuously. 
Feature  Detectors  and  Receptive  Fields.  In  this  work  we 
extract  features  using  a  collection  of  Gabor  filters  where  a 
Gabor  function  that  we  use  is  described  with  the  following 
equation 


V’/oAaO,?/) 


—  A{xcos6-\-ysin6)2-\-(ycos6—xsin6)2) 

O  8cr^ 

V/27TCT 


sm(27r/o  ( xcosO  +  ysinO)) .  (7) 

The  inspiration  for  selecting  these  features  comes  from  the 
fact  that  simple  cells  in  the  visual  cortex  can  be  modeled 
by  Gabor  functions  as  shown  by  Marcelja  Marcelja  (1980) 
and  Daugman  Daugman  (1980). 

One  way  to  constrain  the  values  of  the  free  parameters 
in  Eq.  (7)  is  to  use  information  from  neurophysiological 
data  on  simple  cells  as  suggested  by  Lee  Lee  (1996).  More 
specifically,  the  relation  between  the  spatial  frequency  and 
the  bandwidth  can  be  derived  to  be:  2irf0(j  =  2v7n2(2^  + 
1)/ (2^  —  1)  (see  Lee  (1996)  for  more  detail).  Since  the  spa¬ 
tial  frequency  bandwidths  of  the  simple  and  complex  cells 
have  been  found  to  range  from  0.5  to  2.5  octaves,  clus¬ 
tering  around  1.2  octaves,  we  set  to  1.5  octaves.  The 
orientations  and  bandwidths  of  the  filters  are  set  to:  0  = 
{0,  7r/4,  7t/2,  37t / 4}  and  a  =  {2,  4,  6,  8}.  Each  RF  has  a 
square  form  and  the  size  of  the  smallest  RF  is  3 1x3 1  pixels. 
The  RFs  are  arranged  along  8  directions  and  the  sizes  of  the 
RFs  are  increased  at  the  ratio  of  1.4  (controlled  by  the  en¬ 
large  parameter).  For  example,  the  sizes  of  the  RFs  that  are 
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nearest  neighbors  to  the  central  RF  are  (31xl.4)x(31xl.4). 
The  overlap  between  two  neighboring  receptive  fields  is 
50%  meaning  that  for  two  neighboring  RFs,  the  larger  RF 
covers  50%  of  the  area  of  the  smaller  receptive  field.  The 
recognition  results  are  not  very  sensitive  to  the  small  changes 
in  the  overlap,  enlarge  parameter,  and  the  sizes  of  the  re¬ 
ceptive  fields. 

With  each  RF  we  associate  16  feature  detectors  where 
each  feature  detector  signals  the  presence  of  a  feature  (i.e. 
a  Gabor  filter  of  specific  orientation  and  size)  to  which  it  is 
selective  no  matter  where  the  feature  is  within  its  receptive 
field.  One  way  to  implement  this  functionality  is  to  use  a 
max  operator.  The  processing  is  done  in  the  following  way. 
On  each  region  of  the  image,  covered  by  a  specific  RF,  we 
apply  a  collection  of  16  Gabor  filters  (4  orientations  and 
4  sizes)  and  obtain  16  maps.  Each  map  is  then  supplied 
to  a  corresponding  feature  detector  and  the  feature  detector 
then  finds  a  maximum  over  all  possible  locations.  As  a 
result,  each  feature  detector  finds  the  strongest  feature  (to 
which  it  is  selective)  within  its  RF  but  does  not  provide  any 
information  about  the  location  of  that  feature.  This  makes 
the  number  of  features  that  our  system  uses  over  1,000. 

The  Training  Procedure.  The  training  is  done  in  a  su¬ 
pervised  way.  We  constructed  an  interactive  environment 
that  allows  the  user  to  mark  a  section  of  an  object  and  la¬ 
bel  it  as  a  fixation  region  associated  with  a  specific  view. 
Therefore,  every  point  within  this  region  can  serve  as  the 
view  center.  Once  the  user  marks  a  specific  region,  the  sys¬ 
tem  samples  the  points  within  it  and  calculates  the  mean 
and  variance  for  each  feature  detector.  Since  the  number 
of  training  examples  is  small  the  training  is  very  fast.  Note 
that  during  the  training  procedure  the  input  to  the  system 
is  the  whole  image  and  the  system  learns  to  discriminate 
between  an  object  and  the  background.  It  is  important  to 
stress  that  the  system  does  not  learn  parts  of  the  object,  but 
the  whole  object  from  the  perspective  of  the  specific  fixa¬ 
tion  point. 

4.  RESULTS 

We  tested  the  performance  of  our  system  on  four  object 
categories  (faces,  cars,  airplanes  and  motorcycles)  using 
the  Caltech  database  as  in  Fei-Fei  et  al.  (2003);  Serre  et  al. 
(2005).  As  a  performance  measure  we  used  the  error  rate  at 
equilibrium  point  (EP),  which  is  calculated  by  setting  the 
threshold  so  that  the  miss  rate  is  equal  to  the  false  posi¬ 
tive  rate.  We  chose  this  measure  over  the  Receiver  Oper¬ 
ator  Characteristic  (ROC)  since  it  provides  more  compact 
representation  of  the  results,  in  the  sense  that  much  more 
information  can  be  represented  in  one  graph  compared  to 
ROC  measure.  For  illustrative  purposes,  we  chose  the  face 
category  to  present  some  of  the  properties  of  our  system  in 
more  detail. 

The  system  was  first  trained  on  background  images  in 
order  to  learn  the  “background”  hypothesis.  We  used  20 


Fig.  1.  View  regions  as  selected  by  the  teacher. 

random  images  and  within  each  image  the  system  made 
fixations  at  100  random  locations.  The  system  was  then 
trained  on  specific  views  of  specific  objects.  For  example, 
in  training  the  system  to  learn  the  face  from  the  perspective 
of  the  right  eye,  the  user  marks  with  the  cursor  the  region 
around  the  right  eye  and  the  system  then  makes  fixations 
within  this  region  in  order  to  learn  it. 


FACE  -  different  views 


Fig.  2.  Performance  graphs  for  different  views  of  the  face. 
The  task  is  to  verify  whether  an  image  contains  a  face  and 
to  estimate  the  locations  of  different  views  within  the  face. 

During  the  testing  phase,  the  system  makes  random  fix¬ 
ations  and  for  each  fixation  point  we  calculated  the  proba¬ 
bility  that  the  configuration  of  the  outputs  of  feature  detec¬ 
tors  represents  a  face  from  the  perspective  of  the  right  eye. 
To  make  sure  that  there  are  also  positive  examples  among 
the  random  fixations,  each  testing  image  is  divided  into  the 
view  region(s)  (in  this  case  the  right  eye  region)  and  the  rest 
of  the  image  represents  the  ’’background”  class.  Therefore, 
positive  examples  consisted  of  random  fixation  within  the 
region  of  the  right  eye  and  negative  examples  consisted  of 
random  fixations  outside  the  region  of  the  right  eye.  The 


5 


Table  1. 

Faces  -  Multi  Views 


1 

2 

3 

4 

r.  eye 

98.0  % 

99.0  % 

99.5  % 

100% 

1.  eye 

94.5  % 

99.5  % 

100% 

100% 

nose 

90.5  % 

94.0  % 

96.5  % 

97.5% 

mouth 

92.0  % 

97.5  % 

98.5  % 

99.0  % 

Table  2. 


Cars  -  Multi  Views 

1 

2 

3 

4 

vl 

88.2  % 

91.1  % 

94.2% 

95.8  % 

v2 

86.6  % 

91.5% 

93.3  % 

94.0  % 

v3 

88.9  % 

90.4  % 

93.0  % 

94.9  % 

v4 

84.4  % 

90.7  % 

92.5% 

93.2  % 

system  was  tested  on  instances  that  were  not  used  for  train¬ 
ing.  We  used  200  positive  examples  and  1000  negative  ex¬ 
amples  for  testing.  In  all  of  the  experiments  that  follow,  we 
set  the  number  of  fixations  per  view  (the  number  of  sam¬ 
pling  points)  to  10. 


3  objects,  2  views 


Fig.  3.  Performance  graphs  for  three  different  objects  us¬ 
ing  two  different  views.  The  task  is  to  recognize  a  specific 
object  category  and  estimate  locations  of  different  views 
withing  the  object. 

As  illustrated  in  Figure  2,  the  system  can  easily  learn  a 
new  view  from  only  a  few  training  examples.  Since  the  sys¬ 
tem  was  not  able  to  learn  much  from  one  example,  we  set 
the  performance  to  zero  for  one  training  example.  In  order 
to  learn  the  face  (and  “discard”  information  from  the  back¬ 
ground)  it  has  to  be  presented  with  more  than  one  train¬ 
ing  example.  As  it  turns  out,  two  examples  are  not  quite 
enough,  as  can  be  clearly  seen  in  Figure  2,  but  with  three 
examples  the  system  can  learn  the  new  face  (the  specific 
view  of  the  face)  with  high  confidence. 

In  Figure  3  we  show  that  the  system  can  easily  learn 
classes  other  than  faces.  For  each  class  we  used  two  views 


Fig.  4.  Top:  Three  different  occlusions  used  for  testing 
the  system  on  the  face  it  has  never  seen  before.  The  task 
is  to  detect  a  face  and  estimate  the  location  of  the  right 
eye.  Bottom:  Corresponding  performances  under  different 
occlusions. 


as  illustrated  in  Figure  1.  The  system  was  tested  on  in¬ 
stances  it  has  never  seen  before.  The  task  was  to  detect  an 
object  within  the  image  and  estimate  the  locations  of  the 
specific  views. 

Although  the  performance  of  the  system  is  very  good 
using  only  a  single  view,  we  tested  whether  and  how  much 
information  from  other  fixations  improves  the  performance, 
Tables  1  and  2.  The  tests  were  done  on  faces  and  cars  and 
we  used  4  very  good  views  for  faces  and  4  below  the  aver¬ 
age  views  for  cars.  In  both  cases  the  information  about 
the  spatial  location  of  other  views  improved  the  perfor¬ 
mance.  During  the  training  phase,  the  user  marks  the  fixa¬ 
tion  (view)  regions  and  the  system  then  calculates  the  lo¬ 
cation  likelihoods  for  each  pair  of  regions  separately  by 
randomly  selecting  n  points  from  each  region.  During  the 
testing  phase,  in  order  to  estimate  the  location  of  the  view 
center,  the  system  selects  10  points  with  the  highest  prob¬ 
abilities  (as  representing  the  view)  and  takes  the  average 
over  their  locations.  This  location  is  then  selected  by  the 
system  as  a  representing  the  view  center. 

The  system  was  first  trained  on  individual  views  and  the 
results  are  illustrated  in  column  1  of  Tables  1  and  2.  When 
the  system  used  information  about  the  location  of  one  more 
view,  the  performance  improved,  as  shown  in  column  2. 
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Fig.  5.  Performance  of  the  system  on  different  face  occlu¬ 
sions.  Yellow  stars  denote  correctly  detected  positive  fix¬ 
ations,  green  stars  denote  correctly  detected  negative  fixa¬ 
tions,  red  stars  denote  missed  fixations,  and  blue  stars  de¬ 
note  false  alarm  fixations. 

Note  that  this  information  is  not  provided  by  the  teacher 
but  estimated  by  the  system.  This  means  that  the  recogni¬ 
tion  rates  can  decrease  if  the  system  erroneously  estimates 
the  view  centers.  However,  utilizing  information  about  the 
location  of  two  different  views  always  improved  the  per¬ 
formance  as  captured  by  the  numbers  in  column  3.  The 
best  performance,  as  expected,  was  obtained  when  the  sys¬ 
tem  used  the  information  about  centers  of  the  three  views 
as  shown  in  column  4. 

Since  our  system  uses  information  from  over  1 ,000  fea¬ 
ture  detectors  distributed  over  the  whole  image,  it  is  very 
robust  to  occlusions.  This  is  demonstrated  in  Figure  4,  bot¬ 
tom,  that  illustrates  the  performance  of  the  system  when 
tested  on  occluded  images  as  shown  in  Figure  4,  top  .  The 
system  was  trained  to  recognize  a  face  category  from  the 
perspective  of  the  right  eye  (righ-eye-view)  using  (non- 


occluded)  examples  from  different  people. 

The  system  was  tested  on  three  types  of  occlusions: 
a)  the  bar  covering  both  eyes  (denoted  as  cover  1  in  Fig¬ 
ure  4,  b)  two  large  disconnected  regions  covering  the  face 
(cover  2),  and  c)  the  rectangle  covering  the  face  below  the 
nose  (cover  3).  Tests  were  done  on  face  images  of  peo¬ 
ple  that  were  not  used  for  training.  As  one  can  see,  sys¬ 
tem  can  recognize  the  face  even  when  the  fixating  region 
is  covered  (Figure  5,  top),  which  means  that  it  utilizes  in¬ 
formation  from  the  whole  face  and  not  only  local  infor¬ 
mation  around  the  fixation  point.  We  use  yellow  stars  to 
display  correctly  detected  positive  fixations,  green  stars  for 
correctly  detected  negative  fixations,  red  stars  for  missed 
fixations,  and  blue  stars  for  false  alarm  fixations,  Figure  5. 
Incorrect  fixations  are  bigger  in  size. 


5.  CONCLUSIONS 

In  this  work  we  considered  the  Bayesian  Integrate  And 
Shift  (BIAS)  Neskovic  et  al.  (2006)  model  for  learning  ob¬ 
ject  categories  and  tested  its  performance  on  learning  and 
recognizing  different  object  categories  from  real-world  im¬ 
ages.  In  contrast  to  conventional  learning  algorithms,  such 
as  ANNs,  that  require  hundreds  or  thousands  of  training 
examples,  we  showed  that  our  system  can  learn  a  new  ob¬ 
ject  category  from  only  a  few  examples.  In  addition,  our 
system  provides  information  not  only  about  the  object  cat¬ 
egory  but  also  about  the  local  regions  within  the  object  on 
which  it  is  fixating. 

We  tested  the  performance  of  the  system  on  very  chal¬ 
lenging  examples  of  partially  occluded  targets.  The  train¬ 
ing  was  done  on  different  instances  of  one  category  and 
tested  on  partially  occluded  examples  that  the  system  had 
never  seen  before.  We  demonstrated  that  the  system  is  very 
robust  to  partial  occlusions  and  clutter  and  can  recognize  a 
target  even  if  it  fixates  on  the  occluded  part. 

The  benefit  of  this  system  to  the  soldier  will  be  twofold: 
it  will  reduce  the  cognitive  workload  of  the  soldier  oper¬ 
ating  in  complex  visual  environments  (such  as  those  en¬ 
countered  in  urban  combat),  and  it  will  alert  the  soldier  to 
possible  threats  that  might  otherwise  be  overlooked  due  to 
the  partial  occlusions.  We  believe  that  the  system  for  auto¬ 
matic  detection  of  concealed  and  partially  occluded  target 
will  have  a  significant  impact  on  the  warfighter  especially 
in  light  of  the  latest  developments  in  urban  warfare. 
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